[1]Agrawal, S. and Goyal, N. 2012. Thompson sampling for contextual bandits with linear payoffs. CoRR. abs/1209.3352, (2012).
[2]Auer, P., Cesa-Bianchi, N. and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 2-3 (May 2002), 235–256.
[3]Cappé, O., Garivier, A., Maillard, O.-A., Munos, R. and Stoltz, G. 2013. Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics. 41, 3 (Jun. 2013), 1516–1541.
[4]Chapelle, O. and Li, L. An empirical evaluation of thompson sampling.
[5]Filippi, S., Cappe, O., Garivier, A. and Szepesvári, C. 2010. Parametric bandits: The generalized linear case. Advances in neural information processing systems 23. J. Lafferty, C. Williams, J. Shawe-taylor, R. Zemel, and A. Culotta, eds. 586–594.
[6]Lai, T.L. and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics. 6, 1 (1985), 4–22.
[7]Li, L., Chu, W. and Langford, J. 2010. An unbiased, data-driven, offline evaluation method of contextual bandit algorithms. CoRR. abs/1003.5956, (2010).
[8]Li, L., Chu, W., Langford, J. and Schapire, R.E. 2010. A contextual-bandit approach to personalized news article recommendation. CoRR. abs/1003.0146, (2010).
[9]Precup, D., Sutton, R.S. and Singh, S.P. 2000. Eligibility traces for off-policy policy evaluation. Proceedings of the seventeenth international conference on machine learning (iCML 2000) (San Francisco, CA, USA, 2000), 759–766.
[10]Thompson, W.R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. j-biometrika. 25, 3/4 (Dec. 1933), 285–294.
[11]Yahoo! Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicks-v10.